Online Job Scheduling in Distributed Machine Learning Clusters

نویسندگان

  • Yixin Bao
  • Yanghua Peng
  • Chuan Wu
  • Zongpeng Li
چکیده

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time. Targeting a distributed machine learning system using the parameter server framework, we design an online algorithm for scheduling the arriving jobs and deciding the adjusted numbers of concurrent workers and parameter servers for each job over its course, to maximize overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. Practical effectiveness of the online algorithm is evaluated using trace-driven simulation and testbed experiments, which demonstrate its outperformance as compared to commonly adopted scheduling algorithms in today’s cloud systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Scheduling of Jobs for D-benevolent instances On Identical Machines

We consider online scheduling of jobs with specic release time on m identical machines. Each job has a weight and a size; the goal is maximizing total weight of completed jobs. At release time of a job it must immediately be scheduled on a machine or it will be rejected. It is also allowed during execution of a job to preempt it; however, it will be lost and only weight of completed jobs contri...

متن کامل

Dynamic Virtual Machine Scheduling for Resource Sharing In the Cloud Environment

ABSTRACT: Resource allocation and job scheduling are the core functions of cloud computing. These functions are based on adequate information of available resources. Timely acquiring dynamic resource status information is of great importance in ensuring overall performance of cloud computing. A cloud system for analyzing performance, removing bottleneck, detecting fault, and maintaining dynamic...

متن کامل

Real-time Scheduling of a Flexible Manufacturing System using a Two-phase Machine Learning Algorithm

The static and analytic scheduling approach is very difficult to follow and is not always applicable in real-time. Most of the scheduling algorithms are designed to be established in offline environment. However, we are challenged with three characteristics in real cases: First, problem data of jobs are not known in advance. Second, most of the shop’s parameters tend to be stochastic. Third, th...

متن کامل

Using the black-box approach with machine learning methods in order to improve job scheduling in GRID environments

This article focuses on mapping jobs to resources with use of off-the-shelf machine learning methods. The machine learning methods are used in the black-box manner, having a wide variety of parameters for internal cross validation. In the article we focus on two sets of experiments, both constructed with a goal of congesting the system. In the first part, the machine learning methods are used a...

متن کامل

Two meta-heuristic algorithms for parallel machines scheduling problem with past-sequence-dependent setup times and effects of deterioration and learning

This paper considers identical parallel machines scheduling problem with past-sequence-dependent setup times, deteriorating jobs and learning effects, in which the actual processing time of a job on each machine is given as a function of the processing times of the jobs already processed and its scheduled position on the corresponding machine. In addition, the setup time of a job on each machin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1801.00936  شماره 

صفحات  -

تاریخ انتشار 2018